史上最权威宏基因组软件评估—人工重组宏基因组基准数据集

Original 2017-09-11 刘永鑫 植物微生物组

欢迎点击上方蓝色「植物微生物组」关注我们！

写在前面

近年来，宏基因组学得到了快速发展，但由于研究对象包含成百上千物种混合体，仍面临三大挑战———高度复杂混合物种基因组拼接、混合序列分箱(bin)重构单菌基因组、基因组的物种分类鉴定与注释。
虽然在这三个方向，已经出现了大量软件，但由于缺少标准样品的评估体系，各软件的优缺点、适用范围至今没有系统的评估，用户使用中也极难选择。
今天介绍的这篇来自德国不伦瑞克市赫尔姆霍茨传染病研究中心(Helmholtz Centre for Infection Research)Alice C McHardy教授团队领导的研究成功建立了含有1300多种己知微生物基因组的标准品及数据集，成为目前本领域软件评估的金标准，对现在软件的系统评估，不仅对用户选择与使用有重要的指导意义，同时可以帮助本领域软件和算法的进一步优化和发展。本研究共有45家研究单位参与，本课题组也参与了标准品建立的部分工作。

目前文章已经被Nature Method接收，还末在线发表。预印本于2017年6月12日发表在bioRxiv上，截止9月10日，文章已经被摘要阅读8641次，PDF下载3784，google scholar统计引用10次。

摘要

在宏基因组分析中，组装、分箱、物种注释的计算方法对下游生物学数据解释极为重要。然而，缺少统一的标准数据集评估各方法的表现。本领域的全球程序开发者需要即复杂又真实的数据集作为评估标准。
宏基因组基准(CAMI)：由700个新测序的微生物和600个新病毒的质粒组成，同时包括所有基因组间的相似度关系、公众的可用度和相关的实验方法。
基于参考数据集的评估，大多数据软件在组装和分箱(binning)步重构种水平个体基因组表现良好，但在包含相近菌株时影响很大。物种分类和分箱软件在高级别分类单元非常熟练，但在科水平以下表现迅速下降。同时参数的设置对结果影响也很大，决定结果的可重复性。
CAMI不仅提出了当前计算宏基因组学的挑战，并对研究特定科学问题时软件选择提供指南。

图1. 各软件组装基因组覆盖度

箱线图展示各软件组装高度复杂宏基因组数据结果，在参考基因组中的覆盖度分布情况。(a) 所有基因组、(b) 平均核酸相似度(ANI)大于等于95%的基因组、(c) 平均核酸相似度小于95%的基因组。相同的颜色表示同一个组装软件，但在不同的流程或使用不同的参数设置。
(d) 拼接基因组比例与测序深度的关系。数据分为非冗余基因组(ANI<95%，棕色)、含有菌株水平基因组(ANI>=95%，蓝色)、高拷贝环形元件(绿色)。金标准是所有基因组区至少被宏基因组数据集中1个reads所覆盖，因此低丰度基因组拼接覆盖度的比例可以低于100%。

Figure 1: Boxplots representing the fraction of reference genomes assembled by each assembler for the high complexity data set. (a): all genomes, (b): genomes with ANI >=95%, (c): genomes with ANI < 95%. Coloring indicates the results from the same assembler incorporated in different pipelines or with other parameter settings. (d): genome recovery fraction versus genome sequencing depth (coverage) for the high complexity data set. Data were classified as unique genomes (ANI < 95%, brown color), genomes with related strains present (ANI >= 95%, blue color) and high copy circular elements (green color). The gold standard includes all genomic regions covered by at least one read in the metagenome dataset, therefore the genome fraction for low abundance genomes can be less than 100%.

图2. 基因组分箱重建的纯度和完整度

基因组分组/分箱(binner)重建的平均纯度(X轴)和完整度(Y轴)以及其标准误；基因组非冗余菌珠(ANI< 95%，a)，普通菌株(ANI > 95%，b)。在每次分析中，分类数据量占整体最后1%的组被去除。(c) 不同完整度、污染程度阈值下获得基因组的数量；(d) 调整的Rand索引(ARI, X轴)和样品数据分组比例(Y轴)。ARI计算不考虑末分类序列，因此只反映数据分类部分的准确性。(e,f) 物种分类分组表现基于全部数据(e)和去除1%低丰度数据(f)。图中阴影为标准误。

Figure 2: Average purity (x-axis) and completeness (y-axis) and their standard errors (bars) for genomes reconstructed by genome binners; for genomes of unique strains with equal to or less than 95% ANI to others (a) and common strains with more than 90% ANI to each other (b). For each program and complexity dataset, the submission with the largest sum of purity and completeness is shown (Supplementary Tables 1, 10, 12). In each case, small bins adding up to 1% of the data set size were removed. (c) Number of genomes recovered with varying completeness and contamination (1-purity, Supplementary Table S17). (d) The Adjusted Rand Index (ARI, x-axis) in relation to fraction of the sample assigned (in basepairs) by the genome binners (y-axis). The ARI was calculated excluding unassigned sequences, thus reflects the assignment accuracy for the portion of the data assigned. (e,f) Taxonomic binning performance metrics across ranks for the medium complexity data set, with (e) results for the complete data set and (f) with smallest predicted bins summing up to 1% of the data set removed. Shaded areas indicate the standard error of the mean in precision (purity) and recall (completeness) across taxon bins.

图3. 各分类级物种分类表现

(a) 不同软件在不同分类级别和不同错误矩阵下的相对表现。(b) 不同矩阵表现最好的前三个软件和得分。(c) 不同软件在不同分类级别的Recall和准确度下的绝对表现。

Figure 3: (a) Relative performance of profilers for different ranks and with different error metrics (weighted Unifrac, L1 norm, recall, precision, and false positives), shown here exemplarily for the microbial portion of the first high complexity sample. Each error metric was divided by its maximal value to facilitate viewing on the same scale and relative performance comparisons. A method’s name is given in red (with two asterisks) if it returned no predictions at the corresponding taxonomic rank. (b) Best scoring profilers using different performance metrics summed over all samples and taxonomic ranks to the genus level. A lower score indicates that a method was more frequently ranked highly for a particular metric. The maximum (worst) score for the Unifrac metric is 38 = (18 + 11 + 9) profiling submissions for the low, medium and high complexity datasets respectively), while the maximum score is 190 for all other metrics (= 5 taxonomic ranks * (18 + 11 + 9) profiling submissions for the low, medium and high complexity datasets respectively). (c) Absolute recall and precision for each profiler on the microbial (filtered) portion of the low complexity data set across six taxonomic ranks. Abbreviations are FS (FOCUS), T-P (Taxy-Pro), MP2.0(MetaPhlAn 2.0), MPr (Metaphyler), CK (Common Kmers) and D (DUDes).

表1. 序列组装软件评估结果

六款常用组装软件的名称、基本原理、及评估表现。

表2. 基因组和物种分组软件评估结果

九款常用基因组和物种分组软件的名称、基本原理、评估表现、以及推荐适合的用途。

表3. 物种分类学注释软件评估结果

十款物种分类学注释软件的名称、基本原理、评估结果。

Reference

DOI: https://doi.org/10.1101/099127
预印本下载：http://www.biorxiv.org/content/early/2017/06/12/099127
通讯作者Google学术主页：https://scholar.google.com/citations?user=zJaGqmAAAAAJ

了解更多微生物组专业知识，欢迎扫码关注“植物微生物组”

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突认定该生系高空坠落死亡

桐城一派｜倒在“跨年夜”的龚书记，13个字换来免职调查冤不冤？

比佟丽娅还恋爱脑，怀孕7次流产4次，目睹丈夫背叛却选择原谅

市管干部“龚书记”免职迷局

讣告！又一知名女星在家中去世，终年54岁，曾是无数人白月光…